Analysis of 140k T cells from cancer

In this notebook, we re-analize single-cell TCR/RNA-seq data from Wu et al. (2020) generated on the 10x Genomics platform. The original dataset consists of >140k T cells from 14 treatment-naive patients across four different types of cancer. Roughly 100k of the 140k cells have T-cell receptors.

0. Setup

In [1]:
%load_ext autoreload
%autoreload 2
import sys

sys.path.append("../../scirpy/")
import scirpy as ir
import pandas as pd
import numpy as np
import scanpy as sc
import scipy as sp
from matplotlib import pyplot as plt
import matplotlib
from weblogo.seq import SeqList, unambiguous_protein_alphabet
from weblogo import png_formatter, svg_formatter, eps_formatter
from weblogo import LogoData, LogoOptions, LogoFormat
from IPython.display import Image, display
sc.settings._vector_friendly = True

# whether to run the alignment or use a cached version.
# If `False` and the cached version is not available, will raise an error. 
run_alignment = True

This notebook was run with scirpy commit a73cf2bf46771eaf6072945d94f7a47fc3d80c52 and the following software versions:

In [2]:
sc.logging.print_versions()
scanpy==1.4.6 anndata==0.7.1 umap==0.3.10 numpy==1.18.1 scipy==1.4.1 pandas==0.25.3 scikit-learn==0.22.2.post1 statsmodels==0.11.1 python-igraph==0.8.0
In [3]:
def weblogo(seqs, title="", format="png"):
    """Draw a sequence logo from a list of amino acid sequences. """
    logodata = LogoData.from_seqs(SeqList(seqs, alphabet=unambiguous_protein_alphabet))
    logooptions = LogoOptions(logo_title=title)
    logoformat = LogoFormat(logodata, logooptions)
    if format == "png":
        display(Image(png_formatter(logodata, logoformat)))
    elif format == "eps":
        return eps_formatter(logodata, logoformat)
In [4]:
# colors from the paper
colors = {
    "Tumor+NAT expanded": "#9458a2",
    "Tumor singleton": "#ff8000",
    "NAT singleton": "#f7f719",
    "Tumor multiplet": "#eeb3cb",
    "NAT multiplet": "#9cd0de",
    "Blood singleton": "#cce70b",
    "Blood multiplet": "#beac83"
}

1. Preparing the data

The dataset ships with the scirpy package. We can conveniently load it from the dataset module.

In [5]:
adata = ir.datasets.wu2020()
In [6]:
adata.shape
Out[6]:
(141623, 30727)

We only keep the cells with TCR. ~96k cells remain.

In [7]:
adata = adata[adata.obs["has_tcr"] == "True", :]
adata = adata[~(adata.obs["cluster_orig"] == "nan"), :]
In [8]:
adata.shape
Out[8]:
(96067, 30727)
In [9]:
adata.obs.columns
Out[9]:
Index(['TRA_1_c_gene', 'TRA_1_cdr3', 'TRA_1_cdr3_nt', 'TRA_1_d_gene',
       'TRA_1_expr', 'TRA_1_j_gene', 'TRA_1_junction_ins', 'TRA_1_v_gene',
       'TRA_2_c_gene', 'TRA_2_cdr3', 'TRA_2_cdr3_nt', 'TRA_2_d_gene',
       'TRA_2_expr', 'TRA_2_j_gene', 'TRA_2_junction_ins', 'TRA_2_v_gene',
       'TRB_1_c_gene', 'TRB_1_cdr3', 'TRB_1_cdr3_nt', 'TRB_1_d_gene',
       'TRB_1_expr', 'TRB_1_j_gene', 'TRB_1_junction_ins', 'TRB_1_v_gene',
       'TRB_2_c_gene', 'TRB_2_cdr3', 'TRB_2_cdr3_nt', 'TRB_2_d_gene',
       'TRB_2_expr', 'TRB_2_j_gene', 'TRB_2_junction_ins', 'TRB_2_v_gene',
       'batch', 'clonotype_orig', 'cluster_orig', 'has_tcr', 'multi_chain',
       'patient', 'sample', 'source'],
      dtype='object')
In [10]:
adata.obs["counts"] = adata.X.sum(axis=1).A1
Trying to set attribute `.obs` of view, copying.

Preprocess Transcriptomics data

Transcriptomics data needs to be filtered and preprocessed as with any other single-cell dataset. We recommend following the scanpy tutorial and the best practice paper by Luecken et al.. For the Wu et al. (2020) dataset, the authors already provide clusters and UMAP coordinates. Instead of performing clustering and cluster annotation ourselves, we will just use provided data.

In [11]:
sc.pp.filter_genes(adata, min_cells=10)
sc.pp.filter_cells(adata, min_genes=100)
In [12]:
sc.pp.normalize_per_cell(adata, counts_per_cell_after=1000)
sc.pp.log1p(adata)
In [13]:
adata.obsm["X_umap"] = adata.obsm["X_umap_orig"]
In [14]:
mapping = {
    "3.1-MT": "other",
    "4.1-Trm": "CD4_Trm",
    "4.2-RPL32": "CD4_RPL32",
    "4.3-TCF7": "CD4_TCF7",
    "4.4-FOS": "CD4_FOSS",
    "4.5-IL6ST": "CD4_IL6ST",
    "4.6a-Treg": "CD4_Treg",
    "4.6b-Treg": "CD4_Treg",
    "8.1-Teff": "CD8_Teff",
    "8.2-Tem": "CD8_Tem",
    "8.3a-Trm": "CD8_Trm",
    "8.3b-Trm": "CD8_Trm",
    "8.3c-Trm": "CD8_Trm",
    "8.4-Chrom": "other",
    "8.5-Mitosis": "other",
    "8.6-KLRB1": "other",
    "nan": "nan"
}
adata.obs["cluster"] = [mapping[x] for x in adata.obs["cluster_orig"]]

Let's inspect the UMAP plots. The first three panels show the UMAP plot colored by sample, patient and cluster. We don't observe any clustering of samples or patients that could hint at batch effects.

The lower three panels show the UMAP colored by the T cell markers CD8, CD4, and FOXP3. We can confirm that the markers correspond to their respective cluster labels.

In [15]:
sc.pl.umap(adata, color=["sample", "patient", "cluster", "CD8A", "CD4", "FOXP3"], ncols=2, wspace=.5)
... storing 'cluster' as categorical

TCR Quality Control

While most of T cell receptors have exactly one pair of α and β chains, up to one third of T cells can have dual TCRs, i.e. two pairs of receptors originating from different alleles (Schuldt et al (2019)).

Using the scirpy.tl.chain_pairing function, we can add a summary about the T cell receptor compositions to adata.obs.

  • Orphan chain refers to cells that have either a single alpha or beta receptor chain.
  • Extra chain refers to cells that have a full alpha/beta receptor pair, and an additional chain.
  • Multichain refers to cells with more than two receptor pairs detected. These cells are likely doublets.
In [16]:
%%time
ir.tl.chain_pairing(adata)
CPU times: user 10.5 s, sys: 24.5 ms, total: 10.5 s
Wall time: 10.5 s
In [17]:
ax = ir.pl.group_abundance(
    adata, groupby="chain_pairing", target_col="has_tcr", fraction="has_tcr"
) 
ax.set_ylabel("fraction of cells")
ax.set_xlabel("chain pairing")
ax.set_title("")
fig = ax.get_figure()
fig.savefig("figures/chain_pairing.svg")
In [18]:
sc.pl.umap(adata,
           color="chain_pairing", 
           groups=["Extra beta", "Extra alpha", "Two full chains"],
           size=[10 if x in ["Extra beta", "Extra alpha", "Two full chains"] else 3 for x in adata.obs["chain_pairing"]])
... storing 'chain_pairing' as categorical

Indeed, in this dataset, ~7% of cells have more than a one pair of productive T-cell receptors:

In [19]:
print("Fraction of cells with more than one pair of TCRs: {:.2f}".format(
    np.sum(adata.obs["chain_pairing"].isin(["Extra beta", "Extra alpha", "Two full chains"])) / adata.n_obs
))
Fraction of cells with more than one pair of TCRs: 0.07

Excluding multichain cells

Next, we visualize the Multichain cells on the UMAP plot and exclude them from downstream analysis. Multichain cells likely represent doublets. This is corroborated by the fact that they tend to have a very high number of detected transcripts.

In [20]:
_, pvalue = sp.stats.mannwhitneyu(adata.obs.loc[adata.obs["multi_chain"] == "True", "counts"].values, adata.obs.loc[adata.obs["multi_chain"] == "False", "counts"].values)
In [21]:
fig, (ax1, ax2) =plt.subplots(1,2, figsize=(8, 4), gridspec_kw={'width_ratios': [2, 1]})
sc.pl.umap(adata, color="chain_pairing", groups="Multichain", size=[30 if x == "Multichain" else 3 for x in adata.obs["chain_pairing"]], ax=ax1, legend_loc="none", show=False, frameon=False, title="Multichain UMAP")
sc.pl.violin(adata, "counts", "multi_chain", ax=ax2, show=False, inner="box", stripplot=False)
ax2.set_ylabel("detected reads")
ax2.set_xlabel("")
ax2.set_xticklabels(["other", "multichain"])
ax2.set_title("detected reads per cell")
ax2.text(.3, 50000, f"p={pvalue:.2E}")
plt.subplots_adjust(wspace=0.5)
fig.savefig("figures/multichains.svg", dpi=600)
In [22]:
adata.shape
Out[22]:
(96060, 17690)
In [23]:
adata = adata[adata.obs["chain_pairing"] != "Multichain", :].copy()
In [24]:
adata.shape
Out[24]:
(95586, 17690)

2. Define clonotypes

Defining clonotypes in scirpy is a two-step procedure:

  1. Computing a neighborhood graph based on CDR3 sequences
  2. Finding connected submodules in the neighborhood graph and annotating them as clonotypes

scirpy provides several metrics for creating the neighborhood graph. For instance, it is possible to choose between using nucleotide or amino acid CDR3 sequences, or using a sequence similarity metric based on multiple sequence alignments instead of requiring sequences to be identical.

In [25]:
sc.settings.verbosity = 4

identical nucleotide sequences

The authors of the dataset define the clonotypes on the nucleotide sequences and require all sequences of both receptor arms (and multiple chains in case of dual TCRs) to match.

In [26]:
%%time
ir.pp.tcr_neighbors(adata, receptor_arms="all", dual_tcr="all", cutoff=0, n_jobs=42, sequence="nt", key_added="tcr_neighbors_nt")
    Finished initalizing TcrNeighbors object. 
Finished computing TRA pairwise distances.
Finished computing TRB pairwise distances.
    Finished constructing coord-dictionary
Finished constructing cell x cell distance matrix. 
    Finished converting distances to connectivities. 
CPU times: user 1min 53s, sys: 4.22 s, total: 1min 58s
Wall time: 1min 58s
In [27]:
%%time
ir.tl.define_clonotypes(adata, neighbors_key="tcr_neighbors_nt", key_added="clonotype_nt")
CPU times: user 969 ms, sys: 142 ms, total: 1.11 s
Wall time: 1.11 s

identical amino acid sequences

Commonly, clonotypes are defined based on their amino acid sequence instead, because they recognize the same epitope. This is the scirpy default.

In [28]:
%%time
ir.pp.tcr_neighbors(adata, receptor_arms="all", dual_tcr="all", cutoff=0, n_jobs=42)
    Finished initalizing TcrNeighbors object. 
Finished computing TRA pairwise distances.
Finished computing TRB pairwise distances.
    Finished constructing coord-dictionary
Finished constructing cell x cell distance matrix. 
    Finished converting distances to connectivities. 
CPU times: user 1min 52s, sys: 3.96 s, total: 1min 56s
Wall time: 1min 56s
In [29]:
%%time
ir.tl.define_clonotypes(adata, neighbors_key="tcr_neighbors", key_added="clonotype")
CPU times: user 981 ms, sys: 123 ms, total: 1.1 s
Wall time: 1.1 s

similar amino acid sequences

With scirpy, it is possible to to one step further and summarize cells into the same clonotype that likely recognize the same epitope. This can be done by leveraging levenshtein or alignment distances. Here, we compute the alignment distance with a cutoff of 15, which is equivalent of three As mutating into R.

In [30]:
# adata.write_h5ad("./adata_in.h5ad")
In [31]:
%%time
if run_alignment:
    ir.pp.tcr_neighbors(adata, receptor_arms="all", dual_tcr="all", cutoff=15, key_added="tcr_neighbors_al15")
    adata.write_h5ad("./adata_alignment.h5ad")
else:
    # read cached version
    adata = sc.read_h5ad("./adata_alignment.h5ad")
    Finished initalizing TcrNeighbors object. 
100%|██████████| 31135/31135 [05:40<00:00, 91.53it/s] 
Finished computing TRA pairwise distances.
100%|██████████| 44074/44074 [11:03<00:00, 66.42it/s] 
Finished computing TRB pairwise distances.
    Finished constructing coord-dictionary
Finished constructing cell x cell distance matrix. 
    Finished converting distances to connectivities. 
... storing 'clonotype_nt' as categorical
... storing 'clonotype' as categorical
CPU times: user 8min 24s, sys: 1min 59s, total: 10min 24s
Wall time: 22min 21s
In [32]:
%%time
ir.tl.define_clonotypes(adata, neighbors_key="tcr_neighbors_al15", key_added="clonotype_alignment")
CPU times: user 1.16 s, sys: 205 ms, total: 1.37 s
Wall time: 1.38 s

Visualizing clonotype networks

To visualize the network we first call scirpy.tl.clonotype_network to compute the layout. We can then visualize it using scirpy.pl.clonotype_network.

The following plot visualizes all clonotypes with at least 50 cells. Each dot represents a cell, and each blob a clonotype. In the left panel each clonotype is labelled, in the right panel the clonotypes are colored by patient.

In [33]:
%%time
ir.tl.clonotype_network(adata, min_size=50, layout="components")
CPU times: user 10.8 s, sys: 166 ms, total: 11 s
Wall time: 11 s
In [34]:
ir.pl.clonotype_network(adata,
                        color=["clonotype", "patient"],
                        edges=False, size=50, ncols=2, 
                        legend_fontoutline=2,
                        legend_loc=["on data", "right margin"])
... storing 'clonotype_alignment' as categorical
Out[34]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x2b049cba6ed0>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x2b049cb5d0d0>],
      dtype=object)

3. Clonotype consistency

Before we dive into the analysis of clonal expansion, we compare the different approaches of clonotype definition.

3.1 nucleotide based (scirpy vs. Wu et al.)

In this section, we compare the clonotypes assigned by scirpy assigned to the clonotypes assigned by the authors of the study. The original clonotypes are stored in the clonotype_orig column of obs.

According to the paper, the clonotypes are defined on nucleotide sequences and require sequence identity of all available chains. This should be equivalent of running scirpy.tcr_neighbors with dual_tcr="all" and receptor_arms="all".

To assess if the clonotype definitions are equivalent, we first check if there are clonotypes in clonotype_orig that contain multiple clonotypes according to scirpy's definition.

In [35]:
# the clonotypes in clonotype_orig that contain multiple clonotype_nt
nt_in_orig = adata.obs.groupby(["clonotype_orig", "clonotype_nt"]).size().reset_index().groupby("clonotype_orig").size().reset_index()
nt_in_orig[nt_in_orig[0] > 1]
Out[35]:
clonotype_orig 0

There are none!

Next, we check if there are clonotypes in clonotype_nt that contain multiple clonotypes according to the authors' definition.

In [36]:
# the clonotypes in clonotype_nt that contain multiple clonotype_orig
orig_in_nt = adata.obs.groupby(["clonotype_nt", "clonotype_orig"]).size().reset_index().groupby("clonotype_nt").size().reset_index()
with pd.option_context('display.max_rows', 8):
    display(orig_in_nt[orig_in_nt[0] > 1])
clonotype_nt 0
70 70 2
565 565 5
569 569 2
1548 1548 4
... ... ...
22351 22351 2
25854 25854 2
26240 26240 2
30141 30141 2

22 rows × 2 columns

There are a few!

Let's investigate that further:

In [37]:
with pd.option_context('display.max_rows', 10):
    display(adata.obs.loc[
        adata.obs["clonotype_nt"].isin(orig_in_nt[orig_in_nt[0] > 1]["clonotype_nt"]),
        ["clonotype_nt", "clonotype_orig", "patient", "TRA_1_cdr3_nt", "TRA_2_cdr3_nt", "TRB_1_cdr3_nt", "TRB_2_cdr3_nt"]
    ].sort_values(["clonotype_nt", "clonotype_orig"]))
clonotype_nt clonotype_orig patient TRA_1_cdr3_nt TRA_2_cdr3_nt TRB_1_cdr3_nt TRB_2_cdr3_nt
RT2_ACACCCTAGATATACG-1-0 70 renal2.tnb.C82 Renal2 TGTGCTGTGAGAGATCGCGACTACAAGCTCAGCTTT None None None
RB2_TAAGTGCTCAGTTTGG-1-12 70 renal2.tnb.C82 Renal2 TGTGCTGTGAGAGATCGCGACTACAAGCTCAGCTTT None None None
RN2_TACGGTATCTAACGGT-1-27 70 renal2.tnb.C82 Renal2 TGTGCTGTGAGAGATCGCGACTACAAGCTCAGCTTT None None None
RT3_GAACGGATCGTCTGCT-1-6 70 renal3.tnb.C320 Renal3 TGTGCTGTGAGAGATCGCGACTACAAGCTCAGCTTT None None None
CT1_AGTGGGAAGGTTCCTA-1-13 565 colon1.tn.C372 Colon1 TGTGCTGTGATGGATAGCAACTATCAGTTAATCTGG None None None
... ... ... ... ... ... ... ...
LT4_ATAACGCTCTATCCCG-2-17 25854 lung4.tn.C743 Lung4 TGTGCTGTGAGTGGTACCGGCACTGCCAGTAAACTCACCTTT None None None
ET1_GTTAAGCCACCACGTG-2-16 26240 endo1.tn.C3434 Endo1 None None TGTGCCAGCAGCTTAGTTGGGGCCAACGTCCTGACTTTC None
LN6_GGACGTCGTAAGAGGA-1-20 26240 lung6.tnb.C6956 Lung6 None None TGTGCCAGCAGCTTAGTTGGGGCCAACGTCCTGACTTTC None
LT6_AGCGTATGTAGAGCTG-1-18 30141 lung6.tnb.C829 Lung6 None None TGTGCCAGCTCACCACAACAAGAGACCCAGTACTTC None
RN2_TGGCTGGAGCTAGTCT-1-27 30141 renal2.tnb.C3707 Renal2 None None TGTGCCAGCTCACCACAACAAGAGACCCAGTACTTC None

102 rows × 7 columns

All cells of the same clonotype_nt have the same nucleotide sequences (apart from swapping between TRA_1 and TRA_2 or TRB_1 and TRB_2, respectively). Our method appears to work as expected. The inconsistencies seem to arise from the fact that the clonotypes have the same sequences, but originate from different patients. Apparently, the authors did not allow clonotypes from different patients (which makes sense when approaching clonotypes from a genomic point of view, not from an epitope recognition point of view).

When checking the number of clonotypes_orig per clonotype_nt and patient, we eradicate the differences:

In [38]:
# the clonotypes in clonotype_nt that contain multiple clonotype_orig
orig_in_nt = adata.obs.groupby(["clonotype_nt", "patient", "clonotype_orig"]).size().reset_index().groupby(["clonotype_nt", "patient"]).size().reset_index()
with pd.option_context('display.max_rows', 8):
    display(orig_in_nt[orig_in_nt[0] > 1])
clonotype_nt patient 0

Now, also the number of clonotypes is consistent:

In [39]:
assert adata.obs["clonotype_orig"].unique().size == adata.obs.groupby(["patient", "clonotype_nt"]).size().reset_index().shape[0]
print(f"""Number of clonotypes according to Wu et al.: {adata.obs["clonotype_orig"].unique().size}""")
print(f"""Number of clonotypes according to scirpy: {adata.obs.groupby(["clonotype_nt"]).size().reset_index().shape[0]}""")
print(f"""Number of clonotypes according to scirpy, within patient: {adata.obs.groupby(["patient", "clonotype_nt"]).size().reset_index().shape[0]}""")
Number of clonotypes according to Wu et al.: 54751
Number of clonotypes according to scirpy: 54719
Number of clonotypes according to scirpy, within patient: 54751

3.2 amino-acid vs. nucleotide-based

For a few clonotypes, we observe differences comparing the amino-accid vs. nucleotide-based approach for defining clonotypes. Clonotypes with the same amino acid sequence but different nucleotide sequences recognize the same antigen - but derive from different antedescent cells. This can be an example of convergent evolution.

In [40]:
ir.pl.clonotype_network(adata,
                        color=["clonotype", "clonotype_nt"],
                        edges=False, size=50, ncols=2, 
                        legend_fontoutline=2,
                        legend_loc=["on data", "none"])
Out[40]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x2b049caf5250>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x2b04956c6650>],
      dtype=object)

Let's find the cells that belong to clonotypes that have different nucleotide sequences but the same amino acid sequences:

In [41]:
nt_in_aa = adata.obs.groupby(["clonotype_nt", "clonotype"]).size().reset_index().groupby(["clonotype"]).size().reset_index()
convergent_clonotypes = nt_in_aa.loc[nt_in_aa[0] > 1, "clonotype"]
In [42]:
adata.obs["is_convergent"] = ["convergent" if x else "-" for x in adata.obs["clonotype"].isin(convergent_clonotypes)]
In [43]:
sc.pl.umap(adata, color="is_convergent", groups=["convergent"], size=[10 if x == "convergent" else 3 for x in adata.obs["is_convergent"]])
... storing 'is_convergent' as categorical

The phonomenon appears to primarily occur in CD8+ effector and tissue resident T cells:

In [44]:
ir.pl.group_abundance(adata, groupby="cluster", target_col="is_convergent", fraction=True)
Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x2b04bba49790>

3.3 amino-acid identity vs. alignment distance

When computing the alignment distance, we allowed a distance of 15, based on the BLOSUM62 matrix. This is eqivalent of three As mutating into R.

The alignment distance allows us to identify similar CDR3 sequences, that likely recognize the same epitope.

For the following visualization, we exclude cells with Orphan chains. Having only one chain that needs to match there tend to be a lot of similar cells within a distance of 15 resulting in large, uninformative networks for those cells.

In [45]:
%%time
adata_sub = adata[~adata.obs["chain_pairing"].str.startswith("Orphan"), :]
CPU times: user 1min 1s, sys: 95.1 ms, total: 1min 1s
Wall time: 1min 1s
In [46]:
%%time
ir.tl.clonotype_network(adata_sub, min_size=50, layout="components", neighbors_key="tcr_neighbors_al15", key_clonotype_size="clonotype_alignment_size")
CPU times: user 9.26 s, sys: 639 ms, total: 9.9 s
Wall time: 9.9 s

We'll have a closer look at the following clonotypes, as they seem interesting:

In [47]:
selected_clonotypes = ["10411", "5304", "1626", "632"]
In [48]:
ir.pl.clonotype_network(adata_sub, color=["clonotype_alignment"], groups=selected_clonotypes, edges=False, size=50, ncols=2, legend_loc="on data", legend_fontoutline=3)
Out[48]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x2b051ebf68d0>],
      dtype=object)
In [49]:
ir.pl.clonotype_network(adata_sub, color=["clonotype", "patient"], edges=False, size=50, ncols=2, legend_loc=["none", "right margin"], legend_fontoutline=2)
Out[49]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x2b04b6413f10>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x2b04b6446250>],
      dtype=object)
In [50]:
%%capture
fig, ax = plt.subplots(figsize=(10, 10), dpi=300)
ir.pl.clonotype_network(adata_sub, color="clonotype", edges=False, size=50, legend_loc="none", ax=ax, frameon=False)
ax.set_title("clonotypes")
fig.savefig("figures/clonotype_network.svg")

The patient distribution of the four clonotypes:

In [51]:
ax = ir.pl.group_abundance(adata_sub[adata_sub.obs["clonotype_alignment"].isin(selected_clonotypes), :], groupby="clonotype_alignment", sort=selected_clonotypes, target_col="patient")
ax.xaxis.set_tick_params(rotation=0)
for tick in ax.xaxis.get_majorticklabels():
    tick.set_horizontalalignment("center")
ax.tick_params(labelsize=10)
fig = ax.get_figure()
ax.set_title("")
ax.set_xlabel("clonotype")
fig.savefig("figures/clonotype_network_patients.svg")

Sequence logos

In [52]:
for ct in selected_clonotypes:
    for chain, chain_label in zip(["TRA_1_cdr3", "TRB_1_cdr3"], ["alpha", "beta"]):
        with open(f"figures/logo_ct_{ct}_{chain_label}.eps", 'wb') as f:
            f.write(weblogo(adata.obs.loc[adata.obs["clonotype_alignment"] == ct, chain].values, title=f"clonotype {ct}: {chain_label}-chain", format="eps"))
In [53]:
for ct in selected_clonotypes:
    for chain, chain_label in zip(["TRA_1_cdr3", "TRB_1_cdr3"], ["alpha", "beta"]):
        weblogo(adata.obs.loc[adata.obs["clonotype_alignment"] == ct, chain].values, title=f"clonotype {ct}: {chain_label}-chain", format="png")

Clonotype 632 could is shared among multiple patients and could be specific for a common, viral antigen.

In case of the clonotype 1261 epitope (which we didin't highlight above) we find evidence in vdjdb that it could be specific for an Human Cytomegalievirus (CMV) antigen:

  • In vdjdb, we find a CMV-specific alpha-chain searching with the CAV[STR][LG]QAGTALIF pattern.
  • Searching for the beta-chain pattern does not yield a direct result. However, allowing up to two substitutions, we find a CMV-specific chain as well.

convergent clonotypes

We define a clonotype as being convergent, if there are different versions of nucleotide sequences for similar clonotypes, within the same patient.

We define convergence

  • on the level of amino-acid identity vs nucleotide identity
  • on the level of amino-acid similarity vs nucleotide identity
In [54]:
convergent_aa = adata.obs.groupby(["clonotype", "clonotype_alignment", "patient"]).size().reset_index().groupby(["clonotype_alignment", "patient"]).size().reset_index()
convergent_nt = adata.obs.groupby(["clonotype_nt", "clonotype_alignment", "patient"]).size().reset_index().groupby(["clonotype_alignment", "patient"]).size().reset_index()
convergent_clonotypes_aa = convergent_aa.loc[convergent_aa[0] > 1, "clonotype_alignment"].values
convergent_clonotypes_nt = convergent_nt.loc[convergent_nt[0] > 1, "clonotype_alignment"].values
In [55]:
# convergent clonotypes, AA identity vs alignment
adata.obs["convergent_aa"] = [str(x) for x in adata.obs["clonotype_alignment"].isin(convergent_clonotypes_aa) & ~adata.obs["chain_pairing"].str.startswith("Orphan")]
# convergent clonotypes, NN identity vs AA alignment
adata.obs["convergent_nt"] = [str(x) for x in adata.obs["clonotype_alignment"].isin(convergent_clonotypes_nt) & ~adata.obs["chain_pairing"].str.startswith("Orphan")]
In [56]:
sc.pl.umap(adata, color=["convergent_aa"], groups="True", size=[10 if x == "True" else 3 for x in adata.obs["convergent_aa"]])
... storing 'convergent_aa' as categorical
... storing 'convergent_nt' as categorical
In [57]:
sc.pl.umap(adata, color=["convergent_nt"], groups="True", size=[10 if x == "True" else 3 for x in adata.obs["convergent_nt"]])
In [58]:
%%capture
fig, ax = plt.subplots(figsize=(4,4), dpi=300)
sc.pl.umap(adata, color=["convergent_nt"], groups="True", size=[10 if x == "True" else 3 for x in adata.obs["convergent_nt"]], ax=ax, frameon=False, legend_loc="none", show=False)
ax.set_title("convergent clonotypes")
fig.savefig('figures/convergent_clonotypes_umap.svg')
In [59]:
ir.pl.group_abundance(adata, groupby="cluster_orig", target_col="convergent_nt", fraction=True, sort="alphabetical", fig_kws={"dpi": 120})
Out[59]:
<matplotlib.axes._subplots.AxesSubplot at 0x2b0495b3e690>

Inspecting only converged clonotypes

In [60]:
adata_converged = adata[(adata.obs["convergent_nt"] == "True") & ~adata.obs["chain_pairing"].str.startswith("Orphan"), :]
In [61]:
%%time
ir.tl.clonotype_network(adata_converged, min_size=1, layout="components", neighbors_key="tcr_neighbors_al15", key_clonotype_size="clonotype_alignment_size")
CPU times: user 2.82 s, sys: 86.2 ms, total: 2.91 s
Wall time: 2.91 s
In [62]:
ir.pl.clonotype_network(adata_converged, color=["clonotype_alignment", "clonotype_nt"], edges=False, size=50, ncols=2, legend_loc=["none", "none"], legend_fontoutline=2)
Out[62]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x2b04957d1c10>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x2b049e70f090>],
      dtype=object)

4. Clonal expansion

In this section, we assess the clonal expansion

  • across patients
  • across clusters
  • across tissue sources

With the pl.clonal_expansion function, we can easily visualize the clonal expansion by different variables. Per default, the plot shows the fraction of cells belonging to an expanded clonotype.

clonal expansion across patients

In [63]:
fig, ax = plt.subplots(dpi=100)
ir.pl.clonal_expansion(adata, groupby="patient", summarize_by="cell", show_nonexpanded=True, ax=ax)
ax.tick_params(labelsize=10)
fig.savefig("figures/expansion_patients_cell.svg")

Alternatively, we can show the fraction of expanded clonotypes.

In [64]:
fig, ax = plt.subplots(dpi=100)
ir.pl.clonal_expansion(adata, groupby="patient", summarize_by="clonotype", show_nonexpanded=False, colors=["#FF7F0E", "#2CA02C"], ax=ax)
ax.tick_params(labelsize=10)
fig.savefig("figures/expansion_patients_clonotype.svg")
/home/sturm/.conda/envs/scirpy/lib/python3.7/site-packages/pandas/plotting/_matplotlib/core.py:203: UserWarning: 'colors' is being deprecated. Please use 'color'instead of 'colors'
  "'colors' is being deprecated. Please use 'color'"

In the paper, the authors state that depending on the patient, between 9 and 18% of clonotypes are expanded. Our results are highly consistent with these results:

In [65]:
fraction_expanded = ir.tl.summarize_clonal_expansion(adata, groupby="patient", summarize_by="clonotype", normalize=True).drop("1", axis="columns").sum(axis=1)
In [66]:
print(f"Between {min(fraction_expanded):.1%} and {max(fraction_expanded):.1%} of clonotypes are expanded.")
Between 9.3% and 19.0% of clonotypes are expanded.

clonal expansion across cell-type clusters

We observe that clonal expansion is highest among the CD8+ T-cell clusters, in particular effector and tissue-resident T cells.

In [67]:
ir.pl.clonal_expansion(adata, groupby="cluster_orig", summarize_by="cell", show_nonexpanded=True, fig_kws={"dpi": 120})
Out[67]:
<matplotlib.axes._subplots.AxesSubplot at 0x2b049e763290>

clonal expansion across tissue sources

The fraction of expanded cell is roughly equivalent among tumor and ajacent normal tissue, but lower in blood.

In [68]:
ir.pl.clonal_expansion(adata, groupby="source", summarize_by="cell", show_nonexpanded=True)
Out[68]:
<matplotlib.axes._subplots.AxesSubplot at 0x2b0496b71410>

5. Dual- and blood expanded clonotypes

Finally, we will divide the clonotypes into different categories, based on their expansion in blood, tissue and tumor samples.

In particular, we will

  • identify dual-expanded clonotypes, which are expanded in both adjacent tissue and tumor samples
  • identify blood-expanded clonotypes

and compare their abundance across cell-types and patients.

For an illustration, please refer to Fig. 1b of the Wu et al. (2020) paper.

In [69]:
clonotype_size_by_source = adata.obs.groupby(["patient", "source", "clonotype"], observed=True).size().reset_index(name="clonotype_count_by_source")
adata.obs = adata.obs.reset_index().merge(clonotype_size_by_source, how="left").set_index('index')
In [70]:
blood_expanded = []
for is_expanded, source in zip((adata.obs["source"] == "Blood") & (adata.obs["clonotype_count_by_source"] > 1), adata.obs["source"]):
    if source == "Blood":
        if is_expanded:
            blood_expanded.append("expanded")
        else:
            blood_expanded.append("not expanded")
    else:
        blood_expanded.append("independent")
In [71]:
adata.obs["blood_expanded"] = blood_expanded
In [72]:
clonotype_membership = {ct: list() for ct in adata.obs["clonotype"]}
for clonotype, source in zip(adata.obs["clonotype"], adata.obs["source"]):
    clonotype_membership[clonotype].append(source)
clonotype_membership = {ct: set(sources) for ct, sources in clonotype_membership.items()}
In [73]:
expansion_category = []
for clonotype, clonotype_size, source in zip(adata.obs["clonotype"], adata.obs["clonotype_size"], adata.obs["source"]):
    if clonotype_size == 1:
        if source == "Blood":
            expansion_category.append("Blood singleton")
        elif source == "NAT":
            expansion_category.append("NAT singleton")
        elif source == "Tumor":
            expansion_category.append("Tumor singleton")
    elif clonotype_size >1:
        membership = clonotype_membership[clonotype]
        if "Tumor" in membership and "NAT" in membership:
            expansion_category.append("Tumor+NAT expanded")
        elif "Tumor"in membership:
            expansion_category.append("Tumor multiplet")
        elif "NAT" in membership:
            expansion_category.append("NAT multiplet")
        elif "Blood" in membership:
            # these are *only* expanded in blood
            expansion_category.append("Blood multiplet")
            
assert len(expansion_category) == adata.n_obs       
In [74]:
adata.obs["cell_type"] = adata.obs["cluster_orig"].str[0]
adata.obs["tumor_type"] = adata.obs["patient"].str[:-1]
In [75]:
adata.obs["expansion_category"] = expansion_category
In [76]:
# make categorical and store colors
adata._sanitize()
adata.uns["expansion_category_colors"] = [colors[x] for x in adata.obs["expansion_category"].cat.categories]
... storing 'blood_expanded' as categorical
... storing 'cell_type' as categorical
... storing 'tumor_type' as categorical
... storing 'expansion_category' as categorical

Mostly CD8+ effector and tissue-resident T cells are blood-expanded and dual-expanded:

In [77]:
sc.pl.umap(adata, color="blood_expanded", groups=["expanded", "not expanded"], size=[15 if x in ["expanded", "not expanded"] else 3 for x in adata.obs["blood_expanded"]])
In [78]:
ir.pl.embedding(adata, basis="umap", color=["expansion_category", "cluster_orig"], legend_loc=["right margin", "on data"], show=True, legend_fontoutline=2, frameon=False, wspace=.5, panel_size=(6, 4))
In [79]:
%%capture
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8.5, 4), dpi=300)
sc.pl.umap(adata, color="expansion_category", legend_loc="none", show=False, ax=ax1, frameon=False)
sc.pl.umap(adata, color="cluster_orig", legend_loc="on data", show=False, ax=ax2, legend_fontoutline=2, frameon=False)
ax1.set_title("clonal expansion pattern")
ax2.set_title("cell-type cluster")
plt.subplots_adjust(wspace=.1)
fig.savefig("figures/clonal_expansion_umap.svg")
In [80]:
%%capture
fig, ax = plt.subplots(figsize=(4, 4), dpi=300)
sc.pl.umap(adata, color="cluster_orig", legend_loc="none", show=False, ax=ax, legend_fontoutline=2, frameon=False)
ax.set_title("")
fig.savefig("figures/cluster_orig.png")

Categories by cell type (fractions and count):

In [81]:
ir.pl.group_abundance(
    adata, groupby="cluster_orig", target_col="expansion_category", fraction=True, fig_kws={"dpi": 120}, sort="alphabetical"
)
Out[81]:
<matplotlib.axes._subplots.AxesSubplot at 0x2b0495b3e9d0>
In [82]:
fig, ax = plt.subplots(dpi=120)
ir.pl.group_abundance(
    adata, groupby="cluster_orig", target_col="expansion_category", fraction=False, ax=ax, sort="alphabetical"
)
ax.set_title("")
ax.set_xlabel("cluster")
fig.savefig("figures/expansion_category_cluster.svg")

Categories by tumor type:

In [83]:
ir.pl.group_abundance(
    adata, groupby="tumor_type", target_col="expansion_category", fraction=True
)
Out[83]:
<matplotlib.axes._subplots.AxesSubplot at 0x2b0496dc1f10>

The fraction of blood-expanded, blood non-expanded and blood-independent cells for the four patients with blood samples

In [84]:
ir.pl.group_abundance(
    adata[adata.obs["patient"].isin(["Lung6", "Renal1", "Renal2", "Renal3"]), :],
    groupby="patient",
    target_col="blood_expanded",
    fraction=True
)
Out[84]:
<matplotlib.axes._subplots.AxesSubplot at 0x2b0496da2850>

Tissue infiltration patterns. The bar plots show the distribution of cells by tissue expansion patterns from blood-indpendned, non-expanded and expanded clones.

In [85]:
for patient in ["Lung6", "Renal1", "Renal2", "Renal3"]:
    ax = ir.pl.group_abundance(
        adata[adata.obs["patient"] == patient,:],
        groupby="blood_expanded", 
        target_col="expansion_category", 
        fraction=True)
    ax.set_title(patient)

Tissue expansion patterns of T cell by cluster.

In [86]:
for subset in [["NAT singleton", "NAT multiplet"], ["Tumor singleton", "Tumor multiplet"]]:
    ax = ir.pl.group_abundance(
        adata[adata.obs["expansion_category"].isin(subset),:],
        groupby="cluster_orig", 
        target_col="expansion_category", 
        fraction=False,
        fig_kws={"dpi": 120},
        sort="alphabetical"
    )
    ax.set_title(subset[0].split()[0]) 
In [87]:
for source in ["NAT", "Tumor"]:
    ax = ir.pl.group_abundance(
        adata[(adata.obs["source"] == source) & (adata.obs["expansion_category"] == "Tumor+NAT expanded"),:],
        groupby="cluster_orig", 
        target_col="expansion_category", 
        fraction=False,
        fig_kws={"dpi": 120},
        sort="alphabetical"
    )
    ax.set_title(source)

6. Gene usage

In [88]:
ax = ir.pl.vdj_usage(adata, full_combination=False, top_n=30)
ax.set_ylabel("cells")
ax.set_xlabel("VDJ segment")
ax.set_title("VDJ usage")
fig = ax.get_figure()
fig.savefig("figures/vdj_usage.svg")
In [89]:
ax = ir.pl.spectratype(adata[adata.obs["TRB_1_j_gene"] != "None", :], groupby="TRB_1_cdr3", target_col="TRB_1_j_gene", fig_kws={"dpi": 120})
ax.xaxis.set_tick_params(rotation=90)
for tick in ax.xaxis.get_majorticklabels():
    tick.set_horizontalalignment("center")
ax.set_title("Spectratype of primary TCR-β chain")
ax.set_xlabel("CDR3 sequence length")
fig = ax.get_figure()
fig.savefig("figures/spectratype.svg")
In [ ]: